Abstractions for Machine Learning
نویسنده
چکیده
ions for Machine Learning Based on my experience in building systems and developing high performance algorithms, I learned that using an appropriate programming model was crucial for both improving usability and for achieving high performance. Early efforts in implementing large scale machine learning algorithms were based on the data parallel model provided by systems like MapReduce. However it is often cumbersome to write complex machine learning algorithms in dataparallel models. Many of these algorithms are best formulated as linear algebra operations on arrays. In Presto [10] we developed a distributed array-based abstraction which allows algorithm designers fine-grained control over computation and communication. We extended this work to build distributed data frames in SparkR [14]. Distributed data frames further allows users to do structured data processing and encode partition-aggregate workflows. While distributed data structures are useful for expressing a single machine learning algorithm, a number of realworld applications are more complex and require the combination of multiple algorithms. For example a text classification program might featurize data using TF-IDF scores, then perform dimension reduction using PCA and finally learn a model using logistic regression. To address this, we developed the idea of machine learning pipelines [5] which allow users to compose simple operators. Along with this we also created a library of algorithms [3] and linear algebra operators [16] that are now a part of Apache Spark. Beyond Analytics While my dissertation is primarily focused on analytics applications, I am broadly interested in data processing systems and have also worked on systems for transaction processing (or OLTP) workloads. Replication of data in distributed data stores presents a fundamental trade-off between latency and consistency. In Probabilistically Bounded Staleness [2] (PBS), we introduced a consistency model which provides expected bounds on data staleness with respect to wall clock time. Using PBS, we were able to measure the latency consistency trade-off for quorum-based systems like Cassandra [1]. Sharing distributed storage systems is also challenging while ensuring that service-level objectives (SLO) are met for throughput and latency-sensitive applications. In our work on Cake [15], we developed a coordinated, multi-resource scheduler that enforces SLOs in shared distributed storage systems. Prior to my research at Berkeley, I studied the design of storage systems for non-volatile byte-addressable memory as a part of my masters thesis [8] at UIUC. My work proposed Consistent and Durable Data Structures (CDDSs) [9], a single-level data storage design that allows programmers to safely exploit the low-latency and non-volatile aspects of new memory technologies. Future Research Future research at the intersection of computing systems and machine learning algorithms is necessary to handle changes in hardware and evolution of workloads. Some of the research problems I plan to tackle include: Declarative Abstractions for Machine Learning: Designing large scale machine learning applications is challenging and requires intricate knowledge of the domain, statistics and systems characteristics. Further, implementations of machine learning applications are at a low-level and describe how they should be executed rather than what should be executed. The main reason for this is that the appropriate machine learning algorithm to use often changes based on the data and hardware being used. The adoption of heterogeneous hardware like accelerators and asynchronous algorithms (e.g., HOGWILD!) further exacerbates this problem. My goal is to develop declarative machine learning abstractions that can capture the intent of a wide variety of applications and correspondingly build systems to automatically optimize execution for various hardware targets. Our work in developing the KeystoneML optimizer [5] is an initial step towards this goal. I also plan to investigate new adaptable algorithms that can tune their execution based on cost and latency requirements. In my recent work on Hemingway [4], we proposed techniques to model the convergence rates of machine learning algorithms and this approach can be used to develop adaptable algorithms in the future. Systems for Heterogeneous Hardware: The evolution of datacenter hardware is leading to the adoption of technologies like GPUs, non-volatile memory (NVM), battery-backed DRAM and disaggregated 100Gbps networks. These
منابع مشابه
Learning Useful Abstractions from the Web
Background/Objective: The successful application of machine learning to electronic medical records typically turns on the construction of an appropriate feature vector. Defining abstractions to create high-level, lower-dimensional feature vectors can help in identifying clinically meaningful similarities among patients especially when the number of training examples is limited. In this work, we...
متن کاملMachine Learning for Software Reuse
Recent work on learning apprentice systems suggests new approaches for using interactive programming environments to promote software reuse. Methodologies for software specification and validation yield natural domains of application for explanation-based learning techniques. This paper develops a relation between data abstractions in software and explanationbased generalization problems and sh...
متن کاملCode Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces
With the rise of machine learning, there is a great deal of interest in treating programs as data to be fed to learning algorithms. However, programs do not start off in a form that is immediately amenable to most off-the-shelf learning techniques. Instead, it is necessary to transform the program to a suitable representation before a learning technique can be applied. In this paper, we use abs...
متن کاملThe utility of temporal abstraction in reinforcement learning
The hierarchical structure of real-world problems has motivated extensive research into temporal abstractions for reinforcement learning, but precisely how these abstractions allow agents to improve their learning performance is not well understood. This paper investigates the connection between temporal abstraction and an agent’s exploration policy, which determines how the agent’s performance...
متن کاملInvented Predicates to Reduce Knowledge Acquisition
The aim of this study was to develop machine learning techniques that would facilitate knowledge acquisition from an expert by taking over the knowledge engineering task of identifying intermediate abstractions. As the expert provided knowledge the system would generalize from this knowledge and use the abstractions it learned in order to reduce the need for later knowledge acquisition. This ge...
متن کاملLearning-Based Abstractions for Nonlinear Constraint Solving
We propose a new abstraction refinement procedure based on machine learning to improve the performance of nonlinear constraint solving algorithms on large-scale problems. The proposed approach decomposes the original set of constraints into smaller subsets, and uses learning algorithms to propose sequences of abstractions that take the form of conjunctions of classifiers. The core procedure is ...
متن کامل